17 research outputs found
Monocular 3D Object Detection with Pseudo-LiDAR Point Cloud
Monocular 3D scene understanding tasks, such as object size estimation,
heading angle estimation and 3D localization, is challenging. Successful modern
day methods for 3D scene understanding require the use of a 3D sensor. On the
other hand, single image based methods have significantly worse performance. In
this work, we aim at bridging the performance gap between 3D sensing and 2D
sensing for 3D object detection by enhancing LiDAR-based algorithms to work
with single image input. Specifically, we perform monocular depth estimation
and lift the input image to a point cloud representation, which we call
pseudo-LiDAR point cloud. Then we can train a LiDAR-based 3D detection network
with our pseudo-LiDAR end-to-end. Following the pipeline of two-stage 3D
detection algorithms, we detect 2D object proposals in the input image and
extract a point cloud frustum from the pseudo-LiDAR for each proposal. Then an
oriented 3D bounding box is detected for each frustum. To handle the large
amount of noise in the pseudo-LiDAR, we propose two innovations: (1) use a
2D-3D bounding box consistency constraint, adjusting the predicted 3D bounding
box to have a high overlap with its corresponding 2D proposal after projecting
onto the image; (2) use the instance mask instead of the bounding box as the
representation of 2D proposals, in order to reduce the number of points not
belonging to the object in the point cloud frustum. Through our evaluation on
the KITTI benchmark, we achieve the top-ranked performance on both bird's eye
view and 3D object detection among all monocular methods, effectively
quadrupling the performance over previous state-of-the-art. Our code is
available at https://github.com/xinshuoweng/Mono3D_PLiDAR.Comment: Camera Ready for ICCV Workshop on "Road Scene Understanding and
Autonomous Driving
Deep Reinforcement Learning for Autonomous Driving
Reinforcement learning has steadily improved and outperform human in lots of
traditional games since the resurgence of deep neural network. However, these
success is not easy to be copied to autonomous driving because the state spaces
in real world are extreme complex and action spaces are continuous and fine
control is required. Moreover, the autonomous driving vehicles must also keep
functional safety under the complex environments. To deal with these
challenges, we first adopt the deep deterministic policy gradient (DDPG)
algorithm, which has the capacity to handle complex state and action spaces in
continuous domain. We then choose The Open Racing Car Simulator (TORCS) as our
environment to avoid physical damage. Meanwhile, we select a set of appropriate
sensor information from TORCS and design our own rewarder. In order to fit DDPG
algorithm to TORCS, we design our network architecture for both actor and
critic inside DDPG paradigm. To demonstrate the effectiveness of our model, We
evaluate on different modes in TORCS and show both quantitative and qualitative
results.Comment: no time for further improvemen
GroundNet: Monocular Ground Plane Normal Estimation with Geometric Consistency
We focus on estimating the 3D orientation of the ground plane from a single
image. We formulate the problem as an inter-mingled multi-task prediction
problem by jointly optimizing for pixel-wise surface normal direction, ground
plane segmentation, and depth estimates. Specifically, our proposed model,
GroundNet, first estimates the depth and surface normal in two separate
streams, from which two ground plane normals are then computed
deterministically. To leverage the geometric correlation between depth and
normal, we propose to add a consistency loss on top of the computed ground
plane normals. In addition, a ground segmentation stream is used to isolate the
ground regions so that we can selectively back-propagate parameter updates
through only the ground regions in the image. Our method achieves the
top-ranked performance on ground plane normal estimation and horizon line
detection on the real-world outdoor datasets of ApolloScape and KITTI,
improving the performance of previous art by up to 17.7% relatively.Comment: Camera Ready for ACM MM 201
Image Labeling with Markov Random Fields and Conditional Random Fields
Most existing methods for object segmentation in computer vision are
formulated as a labeling task. This, in general, could be transferred to a
pixel-wise label assignment task, which is quite similar to the structure of
hidden Markov random field. In terms of Markov random field, each pixel can be
regarded as a state and has a transition probability to its neighbor pixel, the
label behind each pixel is a latent variable and has an emission probability
from its corresponding state. In this paper, we reviewed several modern image
labeling methods based on Markov random field and conditional random Field. And
we compare the result of these methods with some classical image labeling
methods. The experiment demonstrates that the introduction of Markov random
field and conditional random field make a big difference in the segmentation
result
CyLKs: Unsupervised Cycle Lucas-Kanade Network for Landmark Tracking
Across a majority of modern learning-based tracking systems, expensive
annotations are needed to achieve state-of-the-art performance. In contrast,
the Lucas-Kanade (LK) algorithm works well without any annotation. However, LK
has a strong assumption of photometric (brightness) consistency on image
intensity and is easy to drift because of large motion, occlusion, and aperture
problem. To relax the assumption and alleviate the drift problem, we propose
CyLKs, a data-driven way of training Lucas-Kanade in an unsupervised manner.
CyLKs learns a feature transformation through CNNs, transforming the input
images to a feature space which is especially favorable to LK tracking. During
training, we perform differentiable Lucas-Kanade forward and backward on the
convolutional feature maps, and then minimize the re-projection error. During
testing, we perform the LK tracking on the learned features. We apply our model
to the task of landmark tracking and perform experiments on datasets of THUMOS
and 300VW
Rotational Rectification Network: Enabling Pedestrian Detection for Mobile Vision
Across a majority of pedestrian detection datasets, it is typically assumed
that pedestrians will be standing upright with respect to the image coordinate
system. This assumption, however, is not always valid for many vision-equipped
mobile platforms such as mobile phones, UAVs or construction vehicles on rugged
terrain. In these situations, the motion of the camera can cause images of
pedestrians to be captured at extreme angles. This can lead to very poor
pedestrian detection performance when using standard pedestrian detectors. To
address this issue, we propose a Rotational Rectification Network (R2N) that
can be inserted into any CNN-based pedestrian (or object) detector to adapt it
to significant changes in camera rotation. The rotational rectification network
uses a 2D rotation estimation module that passes rotational information to a
spatial transformer network to undistort image features. To enable robust
rotation estimation, we propose a Global Polar Pooling (GP-Pooling) operator to
capture rotational shifts in convolutional features. Through our experiments,
we show how our rotational rectification network can be used to improve the
performance of the state-of-the-art pedestrian detector under heavy image
rotation by up to 45
AutoSelect: Automatic and Dynamic Detection Selection for 3D Multi-Object Tracking
3D multi-object tracking is an important component in robotic perception
systems such as self-driving vehicles. Recent work follows a
tracking-by-detection pipeline, which aims to match past tracklets with
detections in the current frame. To avoid matching with false positive
detections, prior work filters out detections with low confidence scores via a
threshold. However, finding a proper threshold is non-trivial, which requires
extensive manual search via ablation study. Also, this threshold is sensitive
to many factors such as target object category so we need to re-search the
threshold if these factors change. To ease this process, we propose to
automatically select high-quality detections and remove the efforts needed for
manual threshold search. Also, prior work often uses a single threshold per
data sequence, which is sub-optimal in particular frames or for certain
objects. Instead, we dynamically search threshold per frame or per object to
further boost performance. Through experiments on KITTI and nuScenes, our
method can filter out false positives while maintaining the recall,
achieving new S.O.T.A. performance and removing the need for manually threshold
tuning
End-to-End 3D Multi-Object Tracking and Trajectory Forecasting
3D multi-object tracking (MOT) and trajectory forecasting are two critical
components in modern 3D perception systems. We hypothesize that it is
beneficial to unify both tasks under one framework to learn a shared feature
representation of agent interaction. To evaluate this hypothesis, we propose a
unified solution for 3D MOT and trajectory forecasting which also incorporates
two additional novel computational units. First, we employ a feature
interaction technique by introducing Graph Neural Networks (GNNs) to capture
the way in which multiple agents interact with one another. The GNN is able to
model complex hierarchical interactions, improve the discriminative feature
learning for MOT association, and provide socially-aware context for trajectory
forecasting. Second, we use a diversity sampling function to improve the
quality and diversity of our forecasted trajectories. The learned sampling
function is trained to efficiently extract a variety of outcomes from a
generative trajectory distribution and helps avoid the problem of generating
many duplicate trajectory samples. We show that our method achieves
state-of-the-art performance on the KITTI dataset. Our project website is at
http://www.xinshuoweng.com/projects/GNNTrkForecast.Comment: Extended abstract. The first two authors contributed equally. Project
website: http://www.xinshuoweng.com/projects/GNNTrkForecast. arXiv admin
note: substantial text overlap with arXiv:2003.0784
Joint Object Detection and Multi-Object Tracking with Graph Neural Networks
Object detection and data association are critical components in multi-object
tracking (MOT) systems. Despite the fact that the two components are dependent
on each other, prior works often design detection and data association modules
separately which are trained with separate objectives. As a result, one cannot
back-propagate the gradients and optimize the entire MOT system, which leads to
sub-optimal performance. To address this issue, recent works simultaneously
optimize detection and data association modules under a joint MOT framework,
which has shown improved performance in both modules. In this work, we propose
a new instance of joint MOT approach based on Graph Neural Networks (GNNs). The
key idea is that GNNs can model relations between variable-sized objects in
both the spatial and temporal domains, which is essential for learning
discriminative features for detection and data association. Through extensive
experiments on the MOT15/16/17/20 datasets, we demonstrate the effectiveness of
our GNN-based joint MOT approach and show state-of-the-art performance for both
detection and MOT tasks. Our code is available at:
https://github.com/yongxinw/GSDTComment: Published in International Conference on Robotics and Automation
(ICRA), 2021. Code is released here: https://github.com/yongxinw/GSD
When We First Met: Visual-Inertial Person Localization for Co-Robot Rendezvous
We aim to enable robots to visually localize a target person through the aid
of an additional sensing modality -- the target person's 3D inertial
measurements. The need for such technology may arise when a robot is to meet
person in a crowd for the first time or when an autonomous vehicle must
rendezvous with a rider amongst a crowd without knowing the appearance of the
person in advance. A person's inertial information can be measured with a
wearable device such as a smart-phone and can be shared selectively with an
autonomous system during the rendezvous. We propose a method to learn a
visual-inertial feature space in which the motion of a person in video can be
easily matched to the motion measured by a wearable inertial measurement unit
(IMU). The transformation of the two modalities into the joint feature space is
learned through the use of a contrastive loss which forces inertial motion
features and video motion features generated by the same person to lie close in
the joint feature space. To validate our approach, we compose a dataset of over
60,000 video segments of moving people along with wearable IMU data. Our
experiments show that our proposed method is able to accurately localize a
target person with 80.7% accuracy using only 5 seconds of IMU data and video.Comment: Published in IROS 2020. Project website is
http://www.xinshuoweng.com/projects/VIPL